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Abstract. The classical perceptron rule provides a varying upper bound 
on the maximum margin, namely the length of the current weight vec- 
tor divided by the total number of updates up to that time. Requiring 
that the perceptron updates its internal state whenever the normalized 
margin of a pattern is found not to exceed a certain fraction of this dy- 
namic upper bound we construct a new approximate maximum margin 
classifier called the perceptron with dynamic margin (PDM). We demon- 
strate that PDM converges in a finite number of steps and derive an up- 
per bound on them. We also compare experimentally PDM with other 
perceptron-like algorithms and support vector machines on hard margin 
tasks involving linear kernels which are equivalent to 2-norm soft margin. 
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1 Introduction 

It is a common belief that learning machines able to produce solution hyperplanes 
with large margins exhibit greater generalization ability [21] and this justifies the 
enormous interest in Support Vector Machines (SVMs) [21,2]. Typically, SVMs 
obtain large margin solutions by solving a constrained quadratic optimization 
problem using dual variables. In their native form, however, efficient implemen- 
tation is hindered by the quadratic dependence of their memory requirements in 
the number of training examples a fact which renders prohibitive the processing 
of large datasets. To overcome this problem decomposition methods [15,6] were 
developed that apply optimization only to a subset of the training set. Although 
such methods led to improved convergence rates, in practice their superlinear 
dependence on the number of examples, which can be even cubic, can still lead 
to excessive runtimes when large datasets are processed. Recently, the so-called 
linear SVMs [7, 8, 13] made their appearance. They take advantage of linear ker- 
nels in order to allow parts of them to be written in primal notation and were 
shown to outperform decomposition SVMs when dealing with massive datasets. 

The above considerations motivated research in alternative large margin clas- 
sifiers naturally formulated in primal space long before the advent of linear 
SVMs. Such algorithms are mostly based on the perceptron [16, 12], the simplest 
online learning algorithm for binary linear classification. Like the perceptron, 
they focus on the primal problem by updating a weight vector which represents 



at each step the current state of the algorithm whenever a data point presented 
to it satisfies a specific condition. It is the ability of such algorithms to process 
one example at a time^ that allows them to spare time and memory resources 
and consequently makes them able to handle large datasets. The first algorithm 
of that kind is the perceptron with margin [3] which is much older than SVMs. 
It is an immediate extension of the perceptron which provably achieves solutions 
with only up to 1/2 of the maximum margin [10]. Subsequently, various algo- 
rithms succccidcd in approximately attaining maximum margin by employing 
modified perceptron- like update rules. Such algorithms include ROMMA [11], 
ALMA [5], CRAMMA [19] and MICRA [20]. Very recently, the same goal was 
accomplished by a generalized perceptron with margin, the margitron [14]. 

The most straightforward way of obtaining large margin solutions through 
a perceptron is by requiring that the weight vector be updated every time the 
example presented to the algorithm has (normalized) margin which does not 
exceed a predefined value [17,18,1]. The obvious problem with this idea, how- 
ever, is that the algorithm with such a fixed margin condition will definitely not 
converge unless the target value of the margin is smaller than the unknown max- 
imum margin. In an earlier work [14] we noticed that the upper bound ||at|| jt 
on the maximum margin, with jjatjj being the length of the weight vector and t 
the number of updates, that comes as an immediate consequence of the percep- 
tron update rule is very accurate and tends to improve as the algorithm achieves 
larger margins. In the present work we replace the fixed target margin value with 
a fraction 1 — e of this varying upper bound on the maximum margin. The hope 
is that as the algorithm keeps updating its state the upper bound will keep ap- 
proaching the maximum margin and convergence to a solution with the desired 
accuracy e will eventually occur. Thus, the resulting algorithm may be regarded 
as a realizable implementation of the perceptron with fixed margin condition. 

The rest of this paper is organized as follows. Section 2 contains some prelim- 
inaries and a motivation of the algorithm based on a qualitative analysis. In Sect. 
3 we give a formal theoretical analysis. Section 4 is devoted to implementational 
issues. Section 5 contains our experimental results while Sect. 6 our conclusions. 

2 Motivation of the Algorithm 

Let us consider a linearly separable training set {(a;^, ^fe)}feLii with vectors aj^ € 
IR'' and labels Ik G {+1,-1}. This training set may either be the original dataset 
or the result of a mapping into a feature space of higher dimensionality [21,2]. 
Actually, there is a very well-known construction [4] making linear separability 
always possible, which amounts to the adoption of the 2-norm soft margin. By 
placing Xfc in the same position at a distance p in an additional dimension, i.e. 
by extending to [a;^,^], we construct an embedding of our data into the so- 
called augmented space [3] . This way, we construct hyperplanes possessing bias 

^ The conversion of online algorithms to the batch setting is done by cycling repeatedly 
through the dataset and using the last hypothesis for prediction. 



in the non-augmented feature space. Following the augmentation, a reflection 
with respect to the origin of the negatively labeled patterns is performed by 
multiplying every pattern with its label. This allows for a uniform treatment of 
both categories of patterns. Also, R = max||t/j,|| with y^, = [IkXkJkp] the k^^ 

augmented and reflected pattern. Obviously, R> p. 

The relation characterizing optimally correct classification of the training 
patterns yj. by a weight vector u of unit norm in the augmented space is 

w • > 7d = max mhi{u'-y^} Vfc . (1) 

: II = 1 ^ 

We shall refer to 7d as the maximiim directional margin. It coincides with the 
maximum margin in the augmented space with respect to hyperplanes passing 
through the origin. For the maximum directional margin and the maximum 
geometric margin 7 in the non-augmented feature space, it holds that 1 < 7/7d < 
R/p. As p — cxD, R/p — >• 1 and, consequently, 7d — >■ 7 [17, 18]. 

We consider algorithms in which the augmented weight vector at is initially 
set to zero, i.e. oq = 0, and is updated according to the classical perceptron rule 

at+i =at + yk (2) 

each time an appropriate misclassification condition is satisfied by a training 
pattern y^. Taking the inner product of (2) with the optimal direction u and 
using (1) we get 

u ■ at+i - u ■ at = u ■ y^ > 
a repeated application of which gives [12] 

||at||>-u-at>7dt . (3) 

From (3) we readily obtain 

7d < ^ (4) 

provided t > 0. Notice that the above upper bound on the maximum directional 
margin 7d is an immediate consequence of the classical perceptron rule and holds 
independent of the misclassification condition. 

It would be very desirable that ||at|| approaches 7d with t increasing since 
this would provide an after-run estimate of the accuracy achieved by an algo- 
rithm employing the classical perceptron update. More specifically, with 7^ being 
the directional margin achieved upon convergence of the algorithm in tc updates, 
it holds that 

In order to understand the mechanism by which ||at|| /t evolves we consider 
the difference between two consecutive values of || at || ^ /f^ which may be shown 

to be given by the relation 

||at||^ ||at+i||^ _ 1 j (\\at\f „ \ , / l|at+ii 



i2 (t+l)2 t{t+l) \\ t 1 + I 




Let us assume that satisfaction of the misclassification condition by a pattern 
y^. has as a consequence that ||at||^/t > at ■ Uf^ (i.e., the normahzed margin 
Ut ■ y^; of y^. (with Ut = at/ || at ||) is smaller than the upper bound (4) on 7,1). 
Let us further assume that after the update has taken place y^, still satisfies 
the misclassification condition and therefore ||at+i||^/(f + 1) > flt+i • y^. Then, 
the r.h.s. of (6) is positive and || at || /t decreases as a result of the update. In 
the event, instead, that the update leads to violation of the misclassification 
condition, ||at+i||^/(t + 1) is not necessarily larger than a,j^i ■ y^. and ||atj| /t 
may not decrease as a result of the update. We expect that statistically, at least in 
the early stages of the algorithm, most updates do not lead to correctly classified 
patterns (i.e., patterns which violate the misclassification condition) and as a 
consequence |lat|| /t will have the tendency to decrease. Obviously, the rate at 
which this will take place depends on the size of the difference || at || ^ — at • y^. 
which, in turn, depends on the misclassification condition. 

If we are interested in obtaining solutions possessing margin the most natural 
choice of misclassification condition is the fixed (normalized) margin condition 

at • yfc < (l-e)7d||at|| (7) 

with the accuracy parameter e satisfying < e < 1. This is an example of a mis- 
classification condition which if it is satisfied ensures that ||at||^/t > aryk- More- 
over, by making use of (4) and (7) it may easily be shown that ||at+i ||^/(t + 1) > 
ai+i ■ y^, for t > e~^R'^/^^. Thus, after at most e~^R'^/^^ updates |lat|l /t de- 
creases monotonically. The perceptron algorithm with fixed margin condition 
(PFM) is known to converge in a finite number of updates to an e-accurate ap- 
proximation of the maximum directional margin hyperplanc [17, 18, 1]. Although 
it appears that PFM demands exact knowledge of the value of 7d, we notice that 
only the value of /3 = (1 — e)7d, which is the quantity entering (7), needs to be 
set and not the values of e and 7d separately. That is why the after-run estimate 
(5) is useful in connection with the algorithm in question. Nevertheless, in order 
to make sure that /3 < 7d a priori knowledge of a fairly good lower bound on 7d 
is required and this is an obvious defect of PFM. 

The above difficulty associated with the fixed margin condition may be reme- 
died if the unknown 7d is replaced for f > with its varying upper bound ||at|| 

at-y,<{l-e)tf . (8) 

Condition (8) ensures that ||at||^/t — at ■ y^ > e\\at\\'^/t > 0. Moreover, as 
in the case of the fixed margin condition, ||at+i||^/(t + 1) — at+i • y^, > for 
t > e^^i?^/7^. As a result, after at most e^^R^ updates the r.h.s. of (6) is 
bounded from below by e ||af ||^ /t^(t -I- 1) > e7j/(i -|- 1) and ||at|| /t decreases 
monotonically and sufficiently fast. Thus, we expect that ||at|| /t will eventually 
approach 7d close enough, thereby allowing for convergence of the algorithm to 
an e-accurate approximation of the maximum directional margin hyperplane. It 
is also apparent that the decrease of ||at|| /t will be faster for larger values of e. 
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The perception algorithm em- 
ploying the misclassification con- 
dition (8) (with its threshold set 
to for t = 0), which may be 
regarded as originating from (7) 
with 7(i replaced for t > by 
its dynamic upper bound ||at|| /t, 
will be named the perceptron with 
dynamic margin (PDM). 

3 Theoretical Analysis 

From the discussion that led to 

the formulation of PDM it is ap- 
parent that if the algorithm con- 
verges it will achieve by construc- 
tion a solution possessing directional margin at least as large as (1 — e)7d- (We 
remind the reader that convergcincc; assumes violation of the misclassification 
condition (8) by all patterns. In addition, (4) holds.) The same obviously applies 
to PFM. Thus, for both algorithms it only remains to be demonstrated that 
they converge in a finite number of steps. This has already been shown for PFM 
[17, 18, 1] but no general e-dependent bound in closed form has been derived. 
Our purpose in this section is to demonstrate convergence of PDM and provide 
explicit bounds for both algorithms. 

Before we proceed with our analysis we will need the following result. 



until no update made within the for loop 



Lemma 1. Let the variable t>e'~' satisfy the inequality 

t <5{l + C + \nt) , 
where 5, C are constants and 5 > e~'-' . Then 

t<to = {l + e-^)S {C + In ((1 + e)S)) 



(9) 



(10) 



Proof. If t > e then (1 -f C + Int) > 1 and inequality (9) is equivalent to 
f{t) = t/{l + C + \nt) - S < 0. For the function f{t) defined in the interval 



-c 



+oo) it holds that f{e-^) < and df/dt ^ {C + lnt)/(l + C + Ini)^ > 
for t > e~'^ . Stated differently, f{t) starts from negative values at t = e^'^ and 
increases monotonically. Therefore, if /(to) ^ then to is an upper bound of all 
t for which f{t) < 0. Indeed, it is not difficult to verify that to > S > e"*" and 



f{to)=6\{l + e-'){l + 



lnln(e^(l + e)(5) 
ln(eC(l + e)(5) 



1 > 



given that In In a;/ In a; < e 



□ 



Now we are ready to derive an upper bound on the number of steps of PFM. 



Theorem 1. The number t of updates of the perceptron algorithm with fixed 
margin condition satisfies the bound 

Proof. From (2) and (7) we get 



|a*+if = ||a,f + lly.f + 2a, . < ||a,||M 1 



R"" , 2(l-e)7d 



llatll' llati 



Then, taking the square root and using the inequaUty y/l-\-x < 1 + x/2 we have 

at+i < at 1 + + ^ < at 1 + ——^ + ^ ., 

V \\at\\ \\M J V 2||at|r / 

Now, by making use of \\at\\ > 7di, we observe that 

K.i||-M<^ + (l-e)7a<gi + (l-e)7a. 
A repeated application of the above inequaUty t — N times {t > N > 1) gives 

||a,i| - ila^ll < ^ 1^ fc-i + (1 - e)j4t - N) 
< 

from where using the obvious bound ||aAr|| < RN we get an upper bound on 
ll«tll 

M<^^[^+ln^yil-e)^,{t-N) + RN . 

Combining the above upper bound on \\at\\, which holds not only for f > iV but 
also for t = N, with the lower bound from (3) we obtain 

Setting 

1 old /, Id,. A 













/>"'"*) 


27d ' 







2e7^ ' 

and choosing A'' = 1 + [a~^], with [x] being the integer part of a; > 0, we finally 
get 

i < ^(l + 2Q; + lna + lni) . (11) 

Notice that in deriving (11) we made use of the fact that aN + —\nN< 
1 + 2q; + Ina. Inequality (11) has the form (9) with C = 2q; + Ina. Obviously, 
e~'-^ < < N < t and e"*" < < 6. Thus, the conditions of Lemma 1 are 
satisfied and the required bound, which is of the form (10), follows from (11). □ 



Finally, we arrive at our main result which is the proof of convergence of 
PDM in a finite number of steps and the derivation of the relevant upper bound. 

Theorem 2. The number t of updates of the perceptron algorithm with dynamic 
margin satisfies the bound 

*o(l-T^fvO*' *o-[e-](f)'(l + ^)* ife<i 

(l + e-i)f ln((l + e)f) if e = i 

to{l-2{l-e)tl-'^), to^^^f ife>i. 
Proof. From (2) and (8) we get 

||at+i||^ = \\atf + 2at ■ + \\yk\t < \\at\t U + + . (12) 



t < < 



Let us assume that e < 1/2. Then, using the inequality (1 + a;)'' > 1 + Cx for 
X > 0, C = 2(1 - e) > 1 in (12) we obtain 



\\at+if<\\atf\^l + jj +R' 

from where by dividing both sides with (t + 1)^(1"'^) we arrive at 
\\at+if \\atf ^ 



1 



2(l-e) 



(i+ 1)2(1-6) ^2(1-6) - (i + 1)2(1-6) • 

A repeated application of the above inequality t — N times (f > A'' > 1) gives 

^2(1-6) ^2(l-e)- ^Z.^ - 



_R2^2e-l / / ^ X 2.-1 



2e - 1 \\N 

Now, let us define 

at ~ 



-1 .(13) 



Rt 

and observe that the bounds \\at\\ < Rt and \\at\\ > 7di confine at to lie in the 
range 

Setting ||ajv|| = unRN in (13) we get the following upper bound on ||at|| 
\\atf < t-^^-^^alR^N-^ |l + f (^)"" ' ^ ] 



which combined with the lower bound ||at||^ > 7^^^ leads to 



For e < 1/2 the term proportional to {t/N)'^'^~^ in (14) is negative and may be 
dropped to a first approximation leading to the looser upper bound to 

on the number t of updates. Then, we may replace t with its upper bound to in 
the r.h.s. of (14) and get the improved bound 

*<*o(l-^|v 

This is allowed given that the term proportional to {t/N)^'^~^ in (14) is negative 
and moreover t is raised to a negative power. Choosing A'' = [e~^] and apf = 1 
(i.e., setting Qjv to its upper bound which is the least favorable assumption) we 
obtain the bound stated in Theorem 2 for e < 1/2. 

Now, let e > 1/2. Then, using the inequality (1 + x)'^ + ({1 - 0x^/2 > 1 + (a:: 
for a; > 0, < C = 2(1 - e) < 1 in (12) and the bound \\at\\ < Rt we obtain 

\\at+^f < \\atf (l + 'V (1 - e)(2e - 1)^ + 



<\\atf[l + ]j +e(3-2e)i?2 . 

By dividing both sides of the above inequality with {t + l)^^^"*) we arrive at 

II"-"' 11-11% e(3-2e),^^ (16) 



2(1-0 



(f + 1)2(1-6) t2(l-e) - V ^(f + 1)2(1-6) 

a repeated application of which, using also ||oif < i?2 < e{3-2e)R^, gives 



1^ < e(3 - 2e)i?2 ^ < e(3 - 2e)R^ (^1 + J* k-^^^-'Uk^ 



e(3 - 2e)i?2 1 + 



t 



2e-l 



2e- 1 

Combining the above bound with the bound ||at||^ > j'^t'^ we obtain 

n2 / y.2£-l _ 1 \ 



or 

For e > 1/2 the term proportional to t^~^'^ in (18) is negative and may be 
dropped to a first approximation leading to the looser upper bound to 



^ e(3 - 2e) 



on the number t of updates. Then, we may replace t with its upper bound to in 
the r.h.s. of (18) and get the improved bound stated in Theorem 2 for e > 1/2. 
This is allowed given that the term proportional to in (18) is negative and 
moreover t is raised to a negative power. 

Finally, taking the limit e 1/2 in (14) (with A/' = 1, ajv = 1) or in (17) we 

get 

t< -^(l + \nt) 
7d 

which on account of Lemma 1 leads to the bound of Theorem 2 for e = 1/2. □ 
Remark 1. The bound of Theorem 2 holds for PFM as well on account of (4). 

The worst-case bound of Theorem 2 for e <C 1 behaves like e~^(i?/7d)^ 
which suggests an extremely slow convergence if we require margins close to the 
maximum. From expression (15) for t^, however, it becomes apparent that a 
more favorable assumption concerning the value of ajv (e.g., ajv ^ 1 or even as 
low as ajv ~ Jd/R) after the first N ^ ajj^ updates does lead to tremendous 
improvement provided, of course, that is not extremely large. Such a sharp 
decrease of ||at|| /t in the early stages of the algorithm, which may be expected 
from relation (6) and the discussion that followed, lies behind its experimentally 
exhibited rather fast convergence. 

It would be interesting to find a procedure by which the algorithm will be 
forced to a guaranteed sharp decrease of the ratio ||at|| /t. The following two 
observations will be vital in devising such a procedure. First, we notice that 
when PDM with accuracy parameter e has converged in tc updates the threshold 
{l^e)\\at^ \ \^ /tc of the misclassification condition must have fallen below 7d \\at^ || . 
Otherwise, the normalized margin Ut^-yf. of all patterns would be larger than 
7d. Thus, at^ < (1 — e)~^7d/-R- Second, after convergence of the algorithm with 
accuracy parameter ei in tci updates we may lower the accuracy parameter 
from the value ei to the value €2 and continue the run from the point where 
convergence with parameter ei has taken place since for all updates that took 
place during the first run the misclassificd patterns would certainly satisfy (at 
that time) the condition associated with the smaller parameter £2- This way, 
the first run is legitimately fully incorporated into the second one and the tc^ 
updates required for convergence during the first run may be considered the first 
tci updates of the second run under this specific policy of presenting patterns to 
the algorithm. Combining the above two observations we see that by employing 



a first run with accuracy parameter ei we force tlie algorithm with accuracy 
parameter €2 < ei to have at decreased from a value 1 to a value at^^ < 
(1 — ei)~^7d/i? in the first tc^ updates. 

The above discussion suggests that we consider a decreasing sequence of 
parameters e„ such that e„_|_i = e„/?7 (77 > 1) starting with = 1/2 and ending 
with the required accuracy e and perform successive runs of PDM with accuracies 
e„ until convergence in t^^ updates is reached. According to our earlier discussion 
tc„ includes the updates that led the algorithm to convergence in the current 
and all previous runs. Moreover, at the end of the run with. pa-rcinietGr we will 
have ensured that at^^ < (1 — en)~^7d/-R- Therefore, tc„+i satisfies tc„+i < to or 

v/e. / (1 - e„)2 Jl2 ^ W2e„ 



This is obtained by substituting in (15) the values e = e„+i = En/??, N = tc„ and 
a AT = (1 — e„)^^7d/i? which is the least favorable choice for at^^ . Let us assume 
that e„ «: 1 and set tc„ = ^~^R'^/'yj with ^„ < 1. Then, 1/(1 - e„)'>^^" ~ e" 
and 

V 1 - 2en/)? 7d y 

For 5„ ~ e„ the term above becomes approximately e''/^ while for ^„ <^ e„ 
approaches 1. We see that under the assumption that PDM with accuracy pa- 
rameter e„ converges in a number of updates ^ R^/jd ratio tc„+i/ic„ in the 
successive run scenario is rather tightly constrained. If, instead, our assumption 
is not satisfied then convergence of the algorithm is fast anyway. Notice, that the 
value of tc„+i/ic„ inferred from the bound of Theorem 2 is ~ (.R/7d)^''~^^^'^" 
which is extremely large. We conclude that PDM employing the successive run 
scenario (PDM-succ) potentially converges in a much smaller number of steps. 



4 Efficient Implementation 

To reduce the computational cost involved in running PDM, we extend the 

procedure of [14, 13] and construct a three-member nested sequence of reduced 
"active sets" of data points. As we cycle once through the full dataset, the 
(largest) first-level active set is formed from the points of the full dataset sat- 
isfying at ■ y^. < ci(l — e) \\at\\'^ /t with ci = 2.2. Analogously, the second-level 
active set is formed as we cycle once through the first-level active set from the 
points which satisfy at ■ < C2(l — e) ||at||^/t with C2 = 1.1. The third-level 
active set comprises the points that satisfy a* • < (1 — e) ||ai|l^ /i as we cycle 
once through the second-level active set. The third-level active set is presented 
repetitively to the algorithm for TVepg mini-epochs. Then, the second-level active 
set is presented N^p^ times. During each round involving the second-level set, a 
new third-level set is constructed and a new cycle of A^eps passes begins. When 
the number of N^p^ cycles involving the second-level set is reached the first-level 



set becomes active again leading to the population of a new second-level active 
set. By invoking the first-level set for the (-/Vepi +1)*'^ time, we trigger the loading 
of the full dataset and the procedure starts all over again until no point is found 
misclassified among the ones comprising the full dataset. Of course, the N^p^, 
Nep2 and N^p^ rounds are not exhausted if no update takes place during a round. 
In all experiments we choose N^p^ = 9, A^'ep^ = N^p^ = 12. In addition, every 
time we make use of the full dataset we actually employ a permuted instance 
of it. Evidently, the whole procedure amounts to a different way of sequentially 
presenting the patterns to the algorithm and does not affect the applicability of 
our theoretical analysis. A completely analogous procedure is followed for PFM. 

An additional mechanism providing a substantial improvement of the compu- 
tational efficiency is the one of performing multiple updates [14, 13] once a data 
point is presented to the algorithm. It is understood, of course, that in order for a 
multiple; update to be compatible with our theoretical analysis it should be equiv- 
alent to a certain number of updates occuring as a result of repeatedly presenting 
to the algorithm the data point in question. For PDM when a pattern y^. is found 
to satisfy the misclassification condition (8) we perform A = + 1 updates at 
once. Here, is the smallest non-negative root of the quadratic equation in the 
variable /U derivable from the relation {t + ^)at+^ ■ — {I — t) ||at_|_^||^ = in 
which at+^-yfc = a* • Hy^f and ||at+^f = \\at\\^ + 2^iat ■ y^ + ij? Wy^W"^ . 

Thus, we require that as a result of the multiple update the pattern violates the 
misclassification condition. Similarly, we perform multiple updates for PFM. 

Finally, in the case of PDM (no successive runs) when we perform multiple 
updates we start doing so after the first full epoch. This way, we avoid the 
excessive growth of the length of the weight vector due to the contribution to 
the solution of many aligned patterns in the early stages of the algorithm which 
hinders the fast decrease of ||af|| /t. Moreover, in this scenario when we select 
the first-level active set as we go through the full dataset for the first time (first 
full epoch) we found it useful to set ci = C2 = 1.1 instead of ci = 2.2. 

5 Experimental Evaluation 

We compare PDM with several other large margin classifiers on the basis of their 
ability to achieve fast convergence to a certain approximation of the "optimal" 
hypcrplane in the feature space where the patterns are linearly separable. For 
linearly separable data the feature space is the initial instance space whereas for 
inseparable data (which is the case here) a space extended by as many dimensions 
as the instances is considered where each instance is placed at a distance A 
from the origin in the corresponding dimension^ [4]. This extension generates a 
margin of at least Aj \frn. Moreover, its employment relies on the well-known 

Vk ~ [VkT^k^Sik, ■ ■ ■ ,lk^Smk], where Sij is Kronecker's 5 and y^, the projection 
of the k^^ extended instance (multiphed by its label Ik) onto the initial instance 
space. The feature space mapping defined by the extension commutes with a possible 
augmentation (with parameter p) in which case y^, = [lkXk,lkp]- Here Xk represents 
the fe**" data point. 



equivalence between the hard margin optimization in the extended space and 
the soft margin optimization in the initial instance space with objective function 
\\w\\^ + ^~'^J2i^i involving the weight vector w and the 2-norm of the slacks 
[2]. Of course, all algorithms are required to solve identical hard margin problems. 

The datasets we used for training are: the Adult (m = 32561 instances, 
n = 123 attributes) and Web (m = 49749, n = 300) UCI datasets as com- 
piled by Piatt [15], the training set of the KDD04 Physics dataset (m = 50000, 
n = 70 after removing the 8 columns containing missing features) obtainable 
from http://kodiak.cs.cornell.edu/kddcup/datasets.html, the Rcal-sim 
(m = 72309, n = 20958), News20 (m = 19996, n = 1355191) and Webspam (un- 
igram treatment with m = 350000, n = 254) datasets all available at http:// 
www.csie.ntu.edu.tw/~cjlin/libsvintools/datasets, the multiclass Cover- 
type UCI dataset (m = 581012, n = 54) and the full Reuters RCVl dataset 
(m = 804414, n = 47236) obtainable from http://www.jinlr.org/papers/ 
volume5/lewis04a/lyrl2004_rcvlv2_README.htm. For the Covcrtypc dataset 
we study the binary classification problem of the first class versus rest while for 
the RCVl we consider both the binary text classification tasks of the Cll and 
CCAT classes versus rest. The Physics and Covcrtypc; datasets were rcscalcd 
by a multiplicative factor 0.001. The experiments were conducted on a 2.5 GHz 
Intel Core 2 Duo processor with 3 GB RAM running Windows Vista. Our codes 
written in C-|--|- were compiled using the g-l--l- compiler under Cygwin. 

The parameter A of the extended space is chosen from the set {3, 1, 0.3, 0.1} 
in such a way that it corresponds approximately to R/W or R/3 depending on 
the size of the dataset such that the ratio 7d/-R does not become too small (given 
that the extension generates a margin of at least A/^/rn,). More specifically, we 
have chosen Z\ = 3 for Covcrtypc, Zi = 1 for Adult, Web and Physics, A = 0.3 
for Webspam, Cll and CCAT and A = 0.1 for Real-sim and News20. We also 
verified that smaller values of A do not lead to a significant decrease of the 
training error. For all datasets and for algorithms that introduce bias through 
augmentation the associated parameter p was set to the value p = l. 

We begin our experimental evaluation by comparing PDM with PFM. We 
run PDM with accuracy parameter e = 0.01 and subsequently PFM with the 
fixed margin ^ = (1 — e)jd set to the value 7^ of the directional margin achieved 
by PDM. This procedure is repeated using PDM-succ with step 77 = 8 (i.e., 
Co = 0.5, ei = 0.0625,62 = e = 0.01). Our results (the value of the directional 
margin 7^ achieved, the number of required updates (upd) for convergence and 
the CPU time for training in seconds (s)) arc presented in Table 1. We see that 
PDM is considerably faster than PFM as far as training time is concerned in spite 
of the fact that PFM needs much less updates for convergence. The successive run 
scenario succeeds, in accordance with our expectations, in reducing the number 
of updates to the level of the updates needed by PFM in order to achieve the 
same value of 7^ at the expense of an increased runtime. We believe that it 
is fair to say that PDM-succ with 77 = 8 has the overall performance of PFM 
without the defect of the need for a priori knowledge of the value of 7d. We also 
notice that although the accuracy e is set to the same value for both scenarios 



Table 1. Results of an experimental evaluation comparing the algorithms PDM 
and PDM-succ with PFM. 



data 

set 


PDM e = 0.01 


PFM 


PDM-succ e = 0.01 


PFM 




lO-'^upd 


s 


lO-^pd 


s 




lO-'^upd 


s 


10-''upd 


s 


Adult 


84.57 


27.43 


3.7 


10.70 


7.3 


84.46 


9.312 


5.3 


9.367 


6.6 


Web 


209.6 


739.4 


0.8 


1.089 


0.9 


209.1 


0.838 


0.9 


0.871 


0.8 


Physics 


44.54 


9.449 


10.4 


6.021 


13.8 


44.53 


5.984 


15.3 


6.006 


13.8 


Real-sim 


39.93 


15.42 


13.6 


12.69 


35.7 


39.74 


5.314 


13.8 


5.306 


14.3 


News20 


91.90 


2.403 


27.4 


1.060 


55.6 


91.68 


0.814 


47.7 


0.813 


43.7 


Webspam 


10.05 


331.0 


197.5 


108.4 


348.0 


10.03 


89.72 


247.0 


89.60 


264.5 


Covertype 


47.51 


189.7 


86.6 


68.86 


156.0 


47.48 


66.03 


146.1 


64.41 


142.5 


Cll 


13.81 


148.6 


156.3 


75.26 


895.1 


13.77 


49.02 


612.4 


49.22 


557.5 


CCAT 


9.279 


307.7 


310.6 


151.2 


1923.5 


9.253 


107.8 


1389.8 


107.8 


1601.0 



the margin achieved with successive runs is lower. This is an indication that 
PDM-succ obtains a better estimate of the maximum directional margin 7d . 

We also considered other large margin classifiers representing classes of al- 
gorithms such as perceptron-like algorithms, decomposition SVMs and linear 
SVMs with the additional requirement that the chosen algorithms need only 
specification of an accuracy parameter. From the class of perceptron-like algo- 
rithms we have chosen (aggressive) ROMMA which is much faster than ALMA 
in the light of the results presented in [9,14]. Decomposition SVMs are repre- 
sented by SVM''^*^* [7] which, apart from being one of the fastest algorithms 
of this class, has the additional advantage of making very efficient use of mem- 
ory, thereby making possible the training on very large datasets. Finally, from 
the more recent class of linear SVMs we have included in our study the dual 
coordinate descent (DCD) algorithm [8] and the margin perceptron with un- 
learning (MPU)*^ [13]. We considered the DCD versions with 1-norm (DCD- 
Ll) and 2-norm (DCD-L2) soft margin which for the same value of the accu- 
racy parameter produce identical solutions if the penalty parameter is C = oo 
for DCD-Ll and C = 1/(2Z\2) for DCD-L2. The source for SVM^s^* (version 
6.02) is available at http://smvliglit.joachims.org and for DCD at http: 
//www.csie.ntu.edu.tw/~cjlin/liblinear. The absence of publicly available 
implementations for ROMMA necessitated the writing of our own code in C++ 
employing the mechanism of active sets proposed in [9] and incorporating a mech- 
anism of permutations performed at the beginning of a full epoch. For MPU 
the implementation followed closely [13] with active set parameters c = 1.01, 
iVepj = VVep2 = 5, gap parameter 6b = iB? and early stopping. 

The experimental results (margin values achieved and training runtimes) 
involving the above algorithms with the accuracy parameter set to 0.01 for all of 

^ MPU uses dual variables but is not formulated as an optimization. It is a perceptron 
incorporating a mechanism of reduction of possible contributions from "very-well 
classified" patterns to the weight vector which is an essential ingredient of SVMs. 



Table 2. Results of experiments with ROMMA, SYM^e^t^ DCD-Ll, DCD-L2 
and MPU algorithms. The accuracy parameter for all algorithms is set to 0.01. 



data 
set 


ROMMA 


gyj^llght 


DCD-Ll 


DCD-L2 


MPU 


10*7d 


s 


10^7' 


s 


10*7d 


s 


s 


10*7d 


s 


Adult 


84.66 


275.8 


84.90 


414.2 


84.95 


0.6 


0.5 


84.61 


0.8 


Web 


209.6 


52.6 


209.4 


40.3 


209.5 


0.7 


0.6 


209.5 


0.3 


Physics 


44.57 


117.7 


44.60 


2341.8 


44.57 


22.5 


20.0 


44.62 


4.9 


Real-sim 


39.89 


1318.8 


39.80 


146.5 


39.81 


6.4 


5.6 


39.78 


3.3 


News20 


92.01 


4754.0 


91.95 


113.8 


92.17 


48.1 


47.1 


91.62 


15.8 


Webspam 


10.06 


39760.6 


10.07 


29219.4 


10.08 


37.5 


33.0 


10.06 


28.2 


Covertype 


47.54 


43282.0 


47.73 


48460.3 


47.71 


18.1 


15.0 


47.67 


18.7 


Cll 


13.82 


146529.2 


13.82 


20127.8 


13.83 


30.7 


27.2 


13.79 


20.2 


CCAT 


9.290 


298159.4 


9.291 


83302.4 


9.303 


51.9 


46.2 


9.264 


36.1 



them are summarized in Table 2. Notice that for SVM^'^^* wc give the geometric 
margin 7' instead of the; directional one 7^ because SYM''^*^* does not introduce 
bias through augmentation. For the rest of the algorithms considered, including 
PDM and PFM, the geometric margin 7' achieved is not listed in the tables since 
it is very close to the directional margin 7^ if the augmentation parameter p is 
set to the value p = 1. Moreover, for DCD-Ll and DCD-L2 the margin values 
coincide as we pointed out earlier. From Table 2 it is apparent that ROMMA and 
SYM''^*^* are orders of magnitude slower than DCD and MPU. Comparing the 
results of Table 1 with those of Table 2 we see that PDM is orders of magnitude 
faster than ROMMA which is its natural competitor since they both belong to 
the class of perceptron-like algorithms. PDM is also much faster than SYM^'^*^* 
but statistically a few times slower than DCD, especially for the larger datasets. 
Moreover, PDM is a few times slower than MPU for all datasets. Finally, we 
observe that the accuracy achieved by PDM is, in general, closer to the before- 
run accuracy 0.01 since in most cases PDM obtains lower margin values. This 
indicates that PDM succeeds in obtaining a better estimate of the maximum 
margin than the remaining algorithms with the possible exception of MPU. 

Before we conclude our comparative study it is fair to point out that PDM 
is not the fastest perceptron-like large margin classifier. From the results of [14] 
the fastest algorithm of this class is the margitron which has strong before-run 
guarantees and a very good after-run estimate of the achieved accuracy through 
(5). However, its drawback is that an approximate knowledge of the value of 7d 
(preferably an upper bound) is required in order to fix the parameter controlling 
the margin threshold. Although there is a procedure to obtain this information, 
taking all the facts into account the employment of PDM seems preferable. 



6 Conclusions 



We introduced the perceptron with dynamic margin (PDM) , a new approximate 
maximum margin classifier employing the classical perceptron update, demon- 
strated its convergence in a finite number of steps and derived an upper bound on 
them. PDM uses the required accuracy as the only input parameter. Moreover, 
it is a strictly online algorithm in the sense that it decides whether to perform 
an update taking into account only its current state and irrespective of whether 
the pattern presented to it has been encountered before in the process of cycling 
repeatedly through the dataset. This certainly does not hold for linear SVMs. 
Our experimental results indicate that PDM is the fastest large margin classifier 
enjoying the above two very desirable properties. 

References 

1. Blum, A.: Lectures on machine learning theory. Carnegie Mellon University, USA. 
Available at http://www.cs.cmu.edu/ avrim/ML09/lect0126.pdf 

2. Cristianini, N., Shawc- Taylor, J.: An introduction to support vector machines (2000) 
Cambridge, UK: Cambridge University Press 

3. Duda, R.O., Hart, P.E.: Pattern classsification and scene analysis (1973) Wiley 

4. Preund, Y., Shapire, R.E.: Large margin classification using the perceptron algo- 
rithm. Machine Learning 37(3) (1999) 277-296 

5. Gentile, C: A new approximate maximal margin classification algorithm. Journal 
of Machine Learning Research 2 (2001) 213-242 

6. Joachims, T.: Making large-scale SVM learning practical. In Advances in kernel 
methods-support vector learning (1999) MIT Press 

7. Joachims, T.: Training linear SVMs in linear time. KDD (2006) 217-226 

8. Hsich. C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S.S., Sundararajan, S.: A dual 
coordinate descent method for large-scale linear SVM. ICML (2008) 408-415 

9. Ishibashi, K., Hatano, K., Takeda, M.: Online learning of maximum p-norm margin 
classifiers. COLT (2008) 69-80. 

10. Krauth, W., Mezard, M.: Learning algorithms with optimal stabihty in neural 
networks. Journal of Physics A20 (1987) L745-L752 

11. Li, Y., Long, P.: The relaxed online maximum margin algorithm. Machine Learning, 
46(1-3) (2002) 361-387 

12. Novikoff, A.B.J.: On convergence proofs on perceptrons. In Proc. Symp. Math. 
Theory Automata, Vol. 12 (1962) 615-622 

13. Panagiotakopoulos, C, Tsampouka, P.: The margin perceptron with unlearning. 
ICML (2010) 855-862 

14. Panagiotakopoulos, C, Tsampouka, P.: The margitron: A generalized perceptron 
with margin. IEEE Transactions on Neural Networks 22(3) (2011) 395-407 

15. Piatt, J.C.: Sequential minimal optimization: A fast algorithm for training support 
vector machines. Microsoft Res. Redmond WA, Tech. Rep. MSR-TR-98-14 (1998) 

16. Rosenblatt, F.: The perceptron: A probabilistic model for information storage and 
organization in the brain. Psychological Review, 65(6) (1958) 386-408 

17. Tsampouka, P., Shawe- Taylor, J.: Perceptron-like largo margin classifiers. 
Tech. Rep., ECS, University of Southampton, UK (2005). Obtainable from 
http://eprints.ecs.soton.ac.uk/10657 



18. Tsampouka, P., Shawc- Taylor, J.: Analysis of generic perception- like large margin 
classifiers. ECML (2005) 750 758 

19. Tsampouka, P., Shawe- Taylor, J.: Constant rate approximate ma:ximum margin 
algorithms. ECML (2006) 437-448 

20. Tsampouka, P., Shawe- Taylor, J.: Approximate maximum margin algorithms with 
rules controlled by the number of mistakes. ICML (2007) 903-910 

21. Vapnik, V.: Statistical learning theory (1998) Wiley 



